Summary

  • It is vital that you draw a quick sketch of the Normal curve that you are considering, before you attempt any calculations.
  • We use R to find an area under the Normal curve. Historically, ‘Stats Tables’ were used.
  • The area under the curve up to point \(x\), represents the chance (or “probability”) of getting a value less than or equal to \(x\).

1 Import data and more plotting tricks

1.1 Data

We continue using the data set collected in MATH1005 in S2, 2022. Make sure you have the data-file math1005_cleaned.csv in your data folder within your STAT5002 folder.

  • Import the data into a variable called math1005. Then we isolate the height variable and remove NAs from it.
math1005 = read.csv("data/math1005_cleaned.csv")
str(math1005)
## 'data.frame':    281 obs. of  8 variables:
##  $ Gender       : chr  "Female" "Female" "Female" "Male" ...
##  $ International: chr  "Domestic" "Domestic" "International" "Domestic" ...
##  $ Major        : chr  "Physics, Chemistry" "Biomedical Engineering" "Statistics" "Transport" ...
##  $ Height       : num  154 182 156 172 193 167 NA 183 150 181 ...
##  $ ShoeSize     : num  9.5 9 9 8 13 40.5 NA 11 7 44 ...
##  $ Age          : int  18 22 18 29 18 19 19 18 18 20 ...
##  $ Country      : chr  "Australia" "Australia" "India" "Nepal" ...
##  $ Language     : chr  "English " "English" "Hindi" "Nepali" ...
height = na.omit(math1005$Height)

1.2 Plot normal curve

The code curve(dnorm(x, m, s), xlim = c(a, b), add=TRUE) plots a normal curve with mean m and SD s in the domain \((a,b)\) on the x axis. The option add=TRUE let you add a normal curve to an existing plot. See the following for an example.

### here is an example
curve(dnorm(x, 0, 0.5), xlim = c(-10, 5), col = "red")
curve(dnorm(x, -2, 2), xlim = c(-10, 5), col = "blue", add=TRUE)

1.3 Plotting area under a normal curve

The following code sketches a Normal Curve with mean 40 and SD 15. The red shaded area represents \(P(X < 18)\). The blue shaded area represents \(P(60 < X < 80)\).

curve(dnorm(x,40,15),from=-20,to=100,ylab="Density",main="N(40,225)")
x = seq(-3.5,3.5,length=1000)*15 + 40
y = dnorm(x,40,15)
y18 = dnorm(18, 40, 15)
polygon(c(min(x), x[x<18], 18, 18), c(0, y[x<18], y18 , 0), col="red")
y60 = dnorm(60, 40, 15)
y80 = dnorm(80, 40, 15)
polygon(c(60, 60, x[x>60&x<80], 80, 80), c(0, y60, y[x>60&x<80], y80, 0), col="blue")

2 Basics of Normal curve

We treat the collected heights as a sample of heights of students in U Syd. We want to fit a normal curve to the histogram of this height sample using the sample mean and sample SD.

Now, plot the histogram of height and then plot the normal curve (defined by the sample mean and sample SD) on top of it. Is the normal curve a reasonable approximation to the histogram in this example?

### Write your code here
m = mean(height)
s = sd(height)
hist(height, freq=F)
curve(dnorm(x, m, s), xlim = c(140, 200), add=TRUE)

Answer: Although there are some differences, the normal curve has a reasonable match to the shape of the histogram.

2.1 Proportion and quantiles under normal curve

  • Under the normal curve, are there more students under 160 cm than students taller than 190 cm? Write R code in the following to calculate this.
### Write your code here
#
pnorm(160, m, s)
## [1] 0.08464811
1 - pnorm(190, m, s)
## [1] 0.04357546

Answer: Yes, the proportion of students under 160 cm (0.085) is larger than that of students taller than 190 cm (0.044).

  • Under the normal curve, what is the 23% percentile of the student heights? Write R code in the following to calculate this.
### Write your code here
#
qnorm(0.23, m, s)
## [1] 166.1809

3 Australian men’s AFL team and heights

In the Australian Football League (AFL) recruiters tend to look for tall male players. We want to use the heights of male students in MATH1005, S2 2002 as a sample to model the Australian male height.

3.1 Data modelling

Select the heights of male students from the data set. Plot a histogram of the selected heights. Construct a normal curve to approximate the histogram of male heights in math1005. Plot the resulting histogram.

### Write your code here
Mselect = !is.na(math1005$Height) & math1005$Gender == "Male"
Mheight = math1005$Height[Mselect]
sum(is.na(Mheight)) # make sure there is no NA left
## [1] 0
length(Mheight) # number of data points
## [1] 167
hist(Mheight, freq=F, main="male heights in MATH1005", ylim=c(0, 0.07))
#
m = mean(Mheight)
s = sd(Mheight)
curve(dnorm(x,m,s),from=150,to=200, add=T)

m 
## [1] 178.8377
s
## [1] 6.533423

3.2 Using R

For each of the following questions, try to use only pnorm and qnorm to calculate the answer.

  1. According to this article, the average height of AFL players is 188cm. What is the chance of finding an Australian man taller than 188cm?
### Write your code here
pnorm(188, m, s, lower.tail = FALSE)
## [1] 0.08040242
  1. The tallest AFL player in history is Aaron Sandilands at 211cm tall. What is the chance of finding a man of height greater than Aaron Sandilands?
### Write your code here
pnorm(211, m, s, lower.tail = FALSE)
## [1] 4.267265e-07
  1. What percentage of Australian men are between 170 and 180cm?
### Write your code here
pnorm(180, m, s)
## [1] 0.570598
pnorm(170, m, s)
## [1] 0.08807664
pnorm(180, m, s)- pnorm(170, m, s)
## [1] 0.4825214
  1. What percentage of Australian men are you taller than?
### Write your code here
pnorm(177, m, s)
## [1] 0.3892476
  1. If 90% of Australian men are below a certain height, what is that height?
### Write your code here
qnorm(0.9,m, s)
## [1] 187.2106
  1. If 40% of Australian men are above a certain height, what is that height?
### Write your code here
qnorm(0.6,m, s)
## [1] 180.4929
  1. What is the interquartile range of heights of Australian men?
### Write your code here
qnorm(0.75,m, s) 
## [1] 183.2445
qnorm(0.25,m, s)
## [1] 174.431
qnorm(0.75,m, s) - qnorm(0.25,m, s)
## [1] 8.813454

3.3 Standard units

For the above questions 1 to 4, answer the questions again by converting the values to standard units and using the standard normal curve.

  1. What is the chance of finding an Australian man taller than 188cm?
### Write your code here
su = (188 - m)/s
pnorm(su, lower.tail = FALSE)
## [1] 0.08040242
  1. What is the chance of finding a man of height greater than 211cm?
### Write your code here
su = (211 - m)/s
pnorm(su, lower.tail = FALSE)
## [1] 4.267265e-07
  1. What percentage of Australian men are between 170 and 180cm?
### Write your code here
su1 = (180 - m)/s
su2 = (170 - m)/s
pnorm(su1) - pnorm(su2)
## [1] 0.4825214
  1. What percentage of Australian men are you taller than?
### Write your code here
su = (177 - m)/s
pnorm(su)
## [1] 0.3892476

3.4 Calculating by hand

Use the 68%-95%-99.7% rule to calculate the following by hand.

  • What percentage of Australian men are shorter than 198.43cm? 99.85%
  • What percentage of Australian men are taller than 165.78cm? 97.5%
  • What percentage of Australian men are between 172.31cm and 191.9cm? 81.5%
  • What is the 97.5th percentile of heights of Australian men? 191.9cm
  • What is the 2.5th percentile of heights of Australian men? 165.78cm
  • Write down an interval which contains 95% of the heights. [165.78,191.9]

4 Correlation coefficient

We want to explore the association between shoesize and height. Now we want to use all data points in math1005.

4.1 First attempt

  • What is the correlation coefficient of the two variables? - use cor. Hint: the argument use = “complete” will ignore NA values.
### Write your code here
cor(math1005$Height, math1005$ShoeSize, use = "complete")
## [1] 0.01064368
  • Produce a scatterplot using the plot function.
### Write your code here
plot(math1005$Height, math1005$ShoeSize)

  • How would you describe the association between shoesize and height? Hint: Some might response using a different shoe size convention, perhaps using EU instead of US. The majority of students reported US shoesize.

Answer: There is almost no association between shoesize and height. This could be caused by outliers in the data set.

4.2 Removing outliers

Since the majority of students reported US shoesize, let’s discard data points with shoesize > 20 as outliers, and then repeat the above procedure, what is your finding?

### Write your code here
Sselect = !is.na(math1005$ShoeSize) & !is.na(math1005$Height) & math1005$ShoeSize < 20
height = math1005$Height[Sselect]
shoe = math1005$ShoeSize[Sselect]
#
cor(height, shoe)
## [1] 0.7757101
plot(height, shoe)

Answer: Now there is a strong positive association between shoesize and height.

4.3 More about correlation coefficient

Using the data without outliers, verifying the following properties of the correlation coefficient using R.

  • Verify that the correlation coefficient is not affected by interchanging the variables (symmetry).
### Write your code here
cor(shoe, height)
## [1] 0.7757101
cor(height, shoe)
## [1] 0.7757101
  • The conversion from US shoesize to EU shoesize approximately follows the formula EU ShoeSize = US ShoeSize x 1.27 + 30. Now transform the cleaned shoesizes (assuming they are US sizes) into EU sizes. Then verify that the correlation coefficient is shift and scale invariant.
### Write your code here
EUshoe = shoe * 1.27 + 30
cor(shoe, height)
## [1] 0.7757101
cor(EUshoe, height)
## [1] 0.7757101